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5 TECHNICAL FIELD 

The present invention relates to a data processing system, and more 
particularly to a turbo decoder and a turbo interleaves 

BACKGROUND OF THE INVENTION 

10 Data signals, in particular those transmitted over a typically hostile RF 

interface (communication channel), are susceptible to error (channel noise) caused by 
interface. Various methods of error correction coding have been developed in order to 
minimize the adverse effects that a hostile interface has on the integrity of 
communicated data. This is also referred to as lowering the Bit Error Rate (BER), 

15 which is generally defined as the ratio of incorrectly received information bits to the 
total number of received information bits. Error correction coding generally involves 
representing digital data in ways designed to be robust with respect to bit errors. Error 
correction coding enables a communication system to recover original data from a 
signal that has been corrupted. 

20 Error correction code includes a convolution code, a parallel concatenated 

convolution code (so called turbo code). A convolution code transforms input 
sequence of bits into an output sequence of bits through the use of finite-state- 
machine, where additional bits are added to the data stream to allow for error- 
correction capability. In order to increase error-correction capability, the amount of 

25 additional bits added and the amount of memory preset in the finite-state-machine 
must be increased which increases decoding complexity. 

In the turbo coding system, a block of data may be encoded with a particular 
coding method resulting in systematic bits and two sets of parity bits. Additionally, 
the original block of input data may be rearranged with an interleaver and then 

30 encoded with the same method as that applied to the original input data. Encoded data 
(systematic bits and parity bits) are combined in some manner to form a serial bit 
stream and transmitted through the communication channel to a turbo decoding 
system. Turbo decoding systems operates on noisy versions of the systematic bits and 
the two sets of parity bits in two decoding stages to produce an estimate of the original 
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message bits. The turbo decoding system uses iterative decoding algorithms and 
consist of interleaver and deinterleaver stages, individually matched to constituent 
decoding stages. The decoding stages of the turbo decoding system uses BCJR 
algorithm, which was originally invented by Bahl, Cocke, Jelinek, and Raviv (hence 
5 the name) to solve maximum a posteriori probability (MAP) detection problem. BCJR 
algorithm is a MAP decoder in that it minimizes the bit errors by estimating the a 
posteriori probabilities of the individual bits in a code word; to reconstruct the original 
data sequence, the soft outputs of the BCJR algorithm are hard-limited. The decoding 
stages exchange with each other the obtained soft output information and iteration of 
10 decoding is ceased when a satisfactory estimate of transmitted information sequence 
has been achieved. 

As the turbo code has extremely impressive performance which is very close 
to Shannon capacity limits, the 3G mobile radio systems such as W-CDMA and 
cdma2000 adopted them for channel coding. 
15 3G wireless systems support the variable bit rate, which may result in full 

reconstruction of the turbo interleaver at every 10ms or 20ms frame. Accordingly, 
generating the whole interleaved address pattern at once consumes much time and 
requires a large-sized RAM to store the pattern. 

Accordingly, high speed turbo interleaver which can support variable bit rate 
20 and does not affect the performance of the turbo coder is required. 

As is well-known, W-CDMA and cdma2000 are different in coding rate and 
interleaver. For example, coding rate of W-CDMA is 1/2, 1/3, 1/4 or 1/5 but coding 
rate of cdma2000 is 1/3, and frame size of the W-CDMA is one of twelve numbers 
378, 370, 762, . . ., and 20730 but that of the cdma2000 is arbitrary integer between 40 
25 and 5 1 14, and row of the block interleaver in W-CDMA is 32 (8-14 of them are 
unused) but that of the cdma2000 is 5, 10, or 20. 

Accordingly, flexible and programmable decoders are required for 3G 
communication because global roaming is recommended between different 3G 
standards and the frame size may change on a frame base. 

30 

SUMMARY OF THE INVENTION 
Embodiments of the present invention provide an interleaver. The interleaver 
comprises a preprocessing means for preparing seed variables and an address 
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generation means for generating interleaved address using the seed variables on the 
fly. The seed variables are in forms of column vectors whose number of elements is 
equal to that of the rows of the two dimensional block interleaver. Consequently the 
number of seed variables is less than the size of block data. If the generated 
5 interleaved address is larger than the size of block data, the generated interleaved 
address is discarded. 

In some embodiments, the seed variables include a base column vector, an 
increment column vector and a cumulative column vector. The number of elements of 
all three column vectors is equal to the number of row of the interleaver block. The 

10 cumulative column vector is updated by adding the increment vector to old 

cumulative column vector after interleaved addresses for one column is generated by 
adding the base column vector and the cumulative column vector. When updating the 
cumulative column vector, if elements of the updated cumulative column vector are 
larger than the number of column of the data block, the elements of the updated 

15 cumulative column vector is subtracted by the number of the column of the data block. 

Elements of the base column vector and the increment column vector are inter- 
row permutated. 

Embodiments of the present invention provide a turbo decoding system. The 
turbo decoding system comprises an interleaver comprising a preprocessing means for 

20 preparing seed variables and an address generation means for generating interleaved 
address using the seed variables, an address queue for storing the generated 
interleaved address equal to or smaller than the interleaver size, an SISO decoder 
performing recursive decoding and calculating log likelihood ratio, and an LLR 
memory connected to the SISO decoder and storing the log likelihood ratio, wherein 

25 the SISO decoder accesses the input data and the log likelihood ratio alternately in a 
sequential order and in an interleaved order using the generated interleaved address. 

In some embodiments, the generated interleaved address is reused as a write 
address for writing the log likelihood ratio outputted from the SISO decoder into the 
LLR memory. 

30 Embodiments of the present invention provide a turbo decoding system 

comprising a processor for generating interleaved addresses and controlling hardware 
blocks, an address queue for storing the generated interleaved addresses, a buffer 
memory block including an LLR memory for storing log likelihood ratio and a 
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plurality of memory blocks for storing soft inputs, an SISO decoder connected to the 
buffer memory block, the SISO decoder including ACSA network for calculating the 
log likelihood ratio recursively from soft inputs and the log likelihood provided by the 
LLR memory and a plurality of memory blocks for storing intermediate results of the 
5 ACSA network. 

In some embodiments, the processor preparing seed variables when the 
interleaver structure changes due to the change of the coding standard or bit rate, and 
generates the interleaved addresses column by column using the seed variables by 
simple add and subtract operations when the interleaved addresses are required. 

10 In some embodiments, the SISO decoder supports Viterbi decoding mode. In 

Viterbi decoding mode, the ACSA network performs Viterbi recursion, the LLR 
memory stores traceback information outputted by the ACSA network, the processor 
processes traceback from the traceback information read from the LLR memory, and 
one of the memory of the SISO decoder stores path metric outputted by the ACSA 

15 network. 

In some embodiments, the processor is a single-instruction and multiple-data 
(SIMD) processor. Preferably, the SIMD processor includes five processing elements, 
and wherein one of five processing elements controls the other four processing 
elements, processes scalar operation, and fetches, decodes, and executes instructions 

20 including control and multi-cycle scalar instructions, and wherein the other four 
processing elements only execute SIMD instruction. 

Embodiments of the present invention provide an interleaving method for 
rearranging data block in a data communication system. The interleaving method 
comprises preparing seed variables and generating interleaved addresses on the fly 

25 using the seed variables. 

BRIEF DESCRIPTION OF THE DRAWINGS 
Other features of the present invention will be more readily understood from 
the following detailed description of the invention when read in conjunction with the 
30 accompanying drawings, in which: 

Fig. 1 illustrates a basic turbo-encoding system. 

Fig. 2 illustrates a typical eight-state RSC encoder of Fig. 1. 

Fig. 3 shows block diagram of the turbo decoding system. 
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Fig. 4 shows extrinsic form of the turbo decoding system of Fig. 3. 

Fig. 5 schematically shows block diagram of time-multiplex turbo decoding 
system according to an embodiment of the present invention. 

Figs. 6A to 6C illustrate a simple example of prunable block interleaver with 
5 interleaver size N = 18 according to conventional interleaving technique. 

Figs. 7A and 7B illustrate a simple example of prunable block interleaver with 
interleaver size N=18 according to the present invention. 

Fig. 8 illustrate more specified block diagram of turbo decoding system of Fig. 
5 in turbo decoding mode. 
10 Fig. 9 illustrate more specified block diagram of turbo decoding system of Fig. 

5 in Viterbi decoding mode. 

Fig. 10 schematically shows detailed ACS A network and related memory 
blocks of Fig. 8. 

Fig. 1 1 shows an ACS A unit contained in the ACS A A section 1022 for 
15 calculating forward metric Ak(s) of Fig. 10. 

Fig. 12 illustrates detailed SIMD processor of Fig. 8. 



DETAILED DESCRIPTION OF THE INVENTION 
20 The present invention now will be described more fully hereinafter with 

reference to the accompanying drawings, in which typical embodiments of the 
invention are shown. This invention may, however, be embodied in many different 
forms and should not be construed as limited to the embodiments set forth herein. 
Rather, these embodiments are provided so that this disclosure will be thorough and 
25 complete, and will fully convey the scope of the invention to those skilled in the art. 

Before proceeding to describe the embodiments of the present invention, 
typical turbo coding system will be described with reference to Figs. 1 to 4 for better 
understanding of the present invention. 

Fig. 1 illustrates a basic turbo-encoding system 100 and Fig. 2 illustrates a 
30 typical eight-state RSC encoder 102, 106 of Fig. 1. 

The encoder of the turbo coding system consists of two constituent systematic 
encoders 102, 106 joined together by means of an interleaver 104. Input data stream u 
is applied directly to first encoder 102, and the interleaved version of the input data 
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stream u is applied to second encoder 106. The systematic bits (i.e., the original 
message bits) x* 5 and the two sets of parity bits ^ and x pl generated by the two 
encoders 102, 106 constitute the output of the turbo encoding system 100 and are 
combined by a multiplexing means 108 to form a serial bit stream and transmitted 
5 over the communication channel. Before transmission, puncturing may be performed 
if necessary. 

The constituent encoder 102, 106 of the turbo encoding system is recursive 
systematic convolution (RSC) code, where one or more of the tap outputs in the sift- 
resistor Dl - D3 back to the input for obtaining better performance of the overall 

10 turbo coding strategy. 

Fig. 3 shows block diagram of the turbo decoding system 300. The turbo 
decoder 300 operates on noisy version of the systematic bits y 5 and noisy version of 
the two set of parity bits y 1 and y^ 2 . The turbo decoding system 300 uses iterative 
decoding algorithms and consist of interleaver 304 and deinterleaver 308, 310 stages, 

15 individually matched to constituent decoding stages 302, 306. The systematic bits y* 
and first set of parity bits y^ 1 of turbo encoded data are applied to the first SISO (Soft- 
Input-Soft-Output) decoder 302. Additionally, deinterleaved version of the metrics 
output of the second SISO decoder 306 is fed back to the first SISO decoder 302. The 
metrics output of the first SISO decoder 302 is applied to the second SISO decoder 

20 306 via interleaver 304. The second set of parity bits y^ 2 is applied to the second SISO 
decoder 306. The output of the deinterleaver 3 10 is applied to hard limiter 312 which 
outputs a bit stream of decoded data u corresponding to the original raw data u. 

As stated earlier, the output of the deinterleaver 308 is fed back to the metrics 
input of the first SISO decoder 302. Thus the turbo decoding system 300 performs the 

25 «th decoding iteration with an input metrics resulting from (rc-l)th decoding iteration. 
The total number of iteration is predetermined, or the iteration stops if a certain 
stopping criterion meets the qualification. 

Fig. 4 shows extrinsic form of the turbo decoding system of Fig. 3, where I 
stands for interleaver, D for deinterleaver, and SISO for soft-input soft-output decoder, 

30 which may use Log-MAP decoding algorithm, Max-Log-MAP, etc. 

The first decoding stage produces a soft estimate Ai(w*) of systematic bit u k 
expressed as log-likelihood ratio 
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A x (u k ) = log{p(u k = 1 1 y',y", A 2e (u))/P(u k = 0 1 y'.y", A 2e (u))} , *= 1, 2, 
-(1) 

where y* is the set of noisy systematic bits, y^ 1 is the set of noisy parity bits 
generated by the first encoder 302, and A 2e (u) is the extrinsic information about the 

5 set of message bits u derived from the second decoding stage and fed back to the first 
stage. 

Hence, the extrinsic information about the message bits derived from the first 
decoding stage is 

A le (u) = A 1 (u)-A 2e (u) (2) 

10 where A 2e (u) is to be defined. 

Before application to the second decoding stage, the extrinsic information 
A, c (u) is reordered to compensate for the interleaving introduced in the turbo 

encoding system 100. In addition, the noisy parity bits y^ 2 generated by the second 
encoder 106 are used as another input. Thus by using BCJR algorithm, the second 
15 decoding produces a more refined soft estimate of the message bits u. 

This estimate is de-interleaved to produce the total log-likelihood ratio A 2 (u) . 

The extrinsic information A 2e (u) fed back to the first decoding stage is therefore 

A 2e (u) = A 2 (u)-A le (u) (3) 

where A u (u) is itself defied by equation (2), and A 2 (u) is the log-likelihood 

20 ratio computed by the second decoding stage. Specifically, for the kth element of the 
vector u, where we have 

A 2 (u k ) = log {P(u k = 1 1 y*,y' 2 , A u (u))/ P(u k = 0 1 y*,y* 2 , A,»)} ,k = 1, 2, 
...,Ar~~(4) 

Through the application of A 2c (u) to the first decoding stage, the feedback 

25 loop around the pair of decoding stages is thereby closed. Note that although in actual 
fact the set of noisy systematic bits y* is only applied to the first decoder 302 in Fig. 3, 
by formulating the information flow in the symmetric extrinsic manner depicted in Fig. 
4 we find that y* is, in fact, also applied to the second decoding stage. 

An estimate of message bits u is computed by hard-limiting the log-likelihood 

30 ratio A 2 (u) at the output of the second decoding stage, as shown by 
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u = sgn (A 2 (u)) , where the signum function operates on each element of 
A 2 (u) individually. To initiate turbo decoding algorithm, we simply set A 2e (u) = 0 

on the first iteration of the algorithm. 

Now turbo decoding system of the present invention will be described. 
5 Embodiments of the present invention provide multi-standard turbo decoding system 
with a processor on which software interleaver is run. Particularly, the present 
invention provides turbo decoding system with a configurable hardware SISO decoder 
and a programmable single-instruction and multiple-data (SIMD) processor 
performing flexible tasks such as interleaving. Software turbo interleaver is run on the 

10 SIMD processor. The decoding system of the present invention can also support 
Viterbi decoding algorithm as well as turbo decoding algorithm such as Log-MAP 
algorithm and Max-Log-MAP algorithm. 

The processor generates interleaved addresses for turbo decoder, supporting 
multiple 3G wireless standard at the speed of the hardware SISO decoder and 

15 changing interleaver structure (i.e., frame size and bit rate) at a very short time with a 
small memory. To hide the timing overhead of interleaving changing, the interleaved 
addresses generation is split into tow parts, pre-processing and incremental on-the-fly 
generation. Pre-processing part prepares a small number of seed variables and 
incremental on-the-fly generation part generates interleaved address based on the seed 

20 variables on the fly. When bit rate changes, the processor carries out only the pre- 
processing part to prepare a small number of seed variables hence requiring short time 
and small memory. Whenever the interleaved address sequence is required, the 
processor generates the interleaved address using the seed variables. This splitting 
method reduces the timing overhead of interleaver and requires only a small memory 

25 to save the seeding variables. 

Fig. 5 schematically shows block diagram of time-multiplex turbo decoding 
system according to an embodiment of the present invention. 

Turbo decoding system 500 of the present invention comprises an SISO 
decoder 502, processor 504 on which software interleaver is run and a A e memory 506 

30 for storing an extrinsic log-likelihood ratio (LLR) as illustrated in Fig. 5. Data are 

sequentially stored in A e memory 506 as they are always read and written in-place. For 
each iteration, data are accessed in a sequential order for the oddth SISO decoding and 



in an interleaved order for the eventh SISO decoding. Namely, in the oddth (first) 
SISO decoding, SISO decoder 502 receives data in a sequential order from the A e 
memory 506 and calculate log-likelihood ratio, and the log-likelihood ratio is written 
into the A c memory 506 in the sequential order. In the eventh (second) SISO decoding, 
5 SISO decoder 502 receives data in an interleaved order from the A c memory 506 and 
calculate new log-likelihood ratio, and the new log-likelihood ratio is deinterleaved 
with the help of the address queue 508 and written into the A c memory 506 in the 
sequential order. As shown with the dotted lines, the processor 504 provides 
interleaved address to read data in an interleaved order. The address queue 508 saves 

10 the addresses in an interleaved order so that the addresses can be used as the write 
addresses for A e memory 506 when the SISO decoder 502 produces results after its 
latency. Accordingly, data in an interleaved order can be deinterleaved into the 
sequential order, and saved in the A e memory 506 in a sequential order. 

In addition to interleaving, the processor 504 can control the hardware blocks 

15 or interface with an external host, and processes the trellis termination and a stopping 
criterion during the first SISO decoding that does not need interleaved addresses. In 
Viterbi decoding mode, SISO decoder 502 is repeatedly used for the Viterbi recursion. 
The A e memory 506 plays the roles of the traceback memory. The processor 504 
processes the traceback from the traceback information read from the A e memory 506. 

20 Flexible software turbo interleaver run on the processor for turbo decoder will 

be described. To hide the timing overhead of interleaver change, the interleaved 
addresses generation is split into tow parts, pre-processing and incremental on-the-fly 
generation. Pre-processing part prepares a small number of seed variables and 
incremental on-the-fly generation part generates interleaved address based on the seed 

25 variables on the fly. When the interleaver size changes due to the change of bit rate or 
the communication standard itself, only the pre-processing part prepares a small 
number of seed variables, not all the interleaved address sequence. Through parallel 
processing using the seed variables, the processor generates interleaved addresses as 
fast as the hardware SISO decoding rate whenever the interleaved address sequence is 

30 required. The unit of on-the-fly address generation is a column of a block interleaver. 
Interleaved addresses are generated column by column, used for read addresses and 
stored in address queue, and then reused for write addresses for deinterleaving. 
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Before proceeding to describe interleaving technique according to the 
embodiment of the present invention, conventional interleaving techniques will be 
described for better understanding of the present invention. 

The turbo decoders for wireless system are based on block turbo interleavers. 
5 Although the operations and parameters vary depending on the standards, W-CDMA 
and cdma2000 shares the prunable block interleaver structure where the interlever is 
implemented by building a mother interleaver of a predefined size and then pruning 
unnecessary addresses. The mother interleavers can be viewed as two-dimensional 
matrices, where the entries are read in the matrix row by row and read out column by 
10 column. Before reading out the entries, intra- and inter-row permutations are 
performed. 

Figs. 6A to 6C illustrate a simple example of a typical prunable block 
interleaver with interleaver size TV = 18, where the data indexes are written in a matrix 
form. Incoming data are written into memory in two-dimensional matrix row by row 
15 as shown in Fig. 6A. Fig. 6B shows intra-row permutation data indexes from Fig. 6A. 
The intra-row permutation rule applied to this example is 
yij = bi+ [(/+1) *<7, ]mod5 (5), 

where yij is the permuted index of the z'th row and yth column, i and j are row 
and column indexes, respectively, b = (bo, bi, b2, b3) = (0, 5, 10, 15), and q = (q 0 , qi, 
20 q 2 , q3) = (1, 2, 3, 7). Fig. 6C shows inter-row permutation data result from Fig. 6B, 

which will be read out from the memory in column by column as a sequence of 17, 1, 
13, 7, 2, 11,9, . . ., 0, 10, 5. The indexes 19, 18 exceeding the range of interest are 
pruned. 

Now interleaving technique according to an embodiment of the present 
25 invention will be described with the example of Figs. 6A to 6C and with reference to 
Figs. 7A and 7B. 

The present embodiment uses an increment vector w of w, = q ( mod 5 instead 
of q, and a cumulative vector Xj of 

x U = *<7,]mod5 -(6). 

30 Equation (6) can be rewritten as 

yij = bi + Xij —(7) 
and Xy can be obtained recursively as 
*u = [(/+!) * (qi mod 5) ] mod 5 



= Kj+1) * w,] mod 5 

= [(jWi mod 5) + wi\ mod 5 

= (*/.,-/ + w/) mod 5, (8) 

where j = 1, 2, 3, 4, 5 and x 0 = w. 
5 As 0^ JC/,y./<5 and 0< w, <5, 0< jc#,y./ + w,- <10 and thus 

*y = *yw + " 5 if + w, > 5, 

= + Wi otherwise (9) 

According to the embodiments of the present invention, multiplication and 
modulo operations of equation (6) are replaced by cheaper operation, multiplication 
10 by an addition and the modulo by a comparison and subtract operation. 

As shown in Fig. 7 A, b, w, and x 0 for the first column of the block interleaver 
are calculated and stored in vector resister of the processor in the preprocessing stage. 
The number of elements of each vector b, w, and Xo correspond to the number of 
elements of a column. Right side of the Fig. 7A shows that b, w, and Xo are stored in 
15 the order of inter-row permutation such as b3, bo, b2, bi = (15, 0, 10, 5), and w = (w 3 , 
wo, W2, wi) =(2, 1, 3, 2) = xo in advance so as to free the on-the-fly generation from 
the inter-row permutation. 

In the column by column on-the-fly address generation stage shown in Fig. 7B, 
the processor updates x y according to equation (9) and calculates the addresses based 
20 on equation (7). Calculated addresses are sent to the address queue, if they are smaller 
than interleaver size N. 

Referring to Fig. 7B, interleaved addresses for first column (y,,i) is calculated 
by adding bi + x 0 . Since x 0 is w, y,j is calculated by adding b 0 = (15, 0, 10, 5) + (2, 1, 
3, 2) = (17, 1, 13, 7). After interleaved addresses for first column is calculated, x, is 
25 updated by adding x 0 = (2, 1, 3, 2) and w = (2, 1, 3, 2). Thereby Xi is set to (4, 2, 1, 4), 
where third element of xj is 1 because (3+3) is larger than 5 and thus subtracted by 5. 
Accordingly, interleaved addresses for second column (y ii2 ) is calculated by adding bi 
+ X! = (15, 0, 10 5) + (4, 2, 1, 4) = (19, 2, 13, 7), where first element 19 is discarded 
since it is larger than or equal to the size N (=18). 
30 As described above, the present invention requires only small memory for 

storing the seed variables and performs add and subtract operations instead of modulo 
operation and multiplication operation. 



11 



The above example is a very simple one and the turbo interleaves in the real 
world such as W-CDMA and cdma2000 standard turbo interleavers are much more 
complex. However, the fundamental structure and the basic operations used in 
permutation rules are the same. The pseudocodes of the on-the-fly address generation 
for W-CDMA and cdma2000 are shown in Table 1 and Table 2, respectively. 

In Table 1, C is the number of columns, R is the number of rows,/? is a prime 
number, and s(jc) is a permutation sequence, which are determined from the 
interleaver size N according to the specification of the W-CDMA standard. Likewise 
C, R, and the binary power 2" in Table 2 are determined from N according to the 
cdma2000 specification. The present invention handles those values also as seed 
variables and calculates them in advance at the preprocessing stage. 

The on-the-fly generation flows of these real-world turbo interleavers are 
similar to the example. They also have base column vector b, increment column 
vector w, cumulative column vector x, and a certain modulo base, x is updated by 
adding w to old x value. If elements of the updated x are larger than the modulo base, 
the elements of the updated cumulative column vector is subtracted by the modulo 
base. This operation substitutes a computationaly expensive modulo operation. Then 
the interleaved addresses for one column are generated by adding b and a vector that 
is calculated from x. 

Table 1 



0 column counter = C-l 

loop: 

1 x=x + w 

2 for each (i=0, 1,..., /M) if(x y > p -1) x, = x, - 

3 load s(x) from the data memory 

4 y = b + s(x) 

5 for each (/=0, 1,. . R-l) if (y, <7V) send y,- to the address queue 

6 if((column_counter— )=£ 0) goto loop 
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Table 2 



0 column_counter = C-l 
5 loop: 

1 x=x + w 

2 for each (/=0, 1,. . ., RA) if(x y > 2") x, = x, - 2" 

3 y = b + x 

4 for each (7=0, 1 . ., R-l) if (y, <A0 send y ( - to the address queue 
10 5 if((column_counter— )^= 0) goto loop 



A SIMD (single-instruction and multiple-data) processor is suitable for this 
operation because the column is a vector and all the column entries get through 
15 exactly the same operations. However, in order to generate one address at every one or 
two cycles, some special instructions to make the long program short can be very 
helpful. 

To speed up address generation, some customized instructions can be used to 
reduce the length of the loop in the on-the-fly generation part. The present invention 

20 introduces three processor instructions: STOLT (store to output port if less than), 

SUBGE (subtract if greater or equal), and LOOP. Each of them substitutes a sequence 
of three ordinary instructions but takes only one clock cycle to execute. For example, 
instruction STOLT corresponds to typical three RISC instructions SUB x, y, z; 
BRANCH if z >= 0; STO x. Likewise, SIMD instruction SUBGE corresponds to three 

25 RISC instructions SUB x, y, z; BRANCH if z < 0; MOVE z, x . 

Pruning can be mapped to STOLT. The function of STOLT is to send the 
calculated interleaved address to the address queue only if calculated interleaved 
address is smaller than N, which is needed for the pruning as in line 5 of the 
pseudocode of Table 1 and line 4 of Table 2. 

30 Another conditional instruction SUBGE, is quite useful for the block 

interleavers that commonly use modulo operations. Instruction SUBGE substitutes a 
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modulo or remainder operation a mod b if the condition 0<a<2b is satisfied, which 
corresponds to (9) and line 2 of Table 1 and Table 2. 

Adopted in several DSP processors to reduce the loop overhead of the address 
generation, LOOP instruction is also helpful in our application. This instruction 
5 conforms to a sequence of CMP, BNE (branch if not equal), and SUB instructions, 
which at once decrements the loop count and branches. 

Using these special instructions, the present invention can reduce the lengths 
of the on-the-fly generation program loop of W-CDMA, cdma2000, and CCSDS to 
six, five, and four instructions, respectively. Using loop-unrolling technique, the 

10 present invention can shorten further the loop length of the on-the-fly address 
generation parts by almost one instruction. 

In the turbo interleaver pseudocodes of Table 1 and Table 2, each line 
corresponds to an instruction of the SIMD processor code. In Table 1, the line 2 
corresponds to SUBGE, the line 5 to STOLT, and the line 6 to LOOP. The SUBGE 

15 safely substitutes x,= jc,- mod (p-1) because the condition 0 <x,<2(p-l) is satisfied (0 
<*i <p-l and 0^ w, <p-l before they are added). If R = 10 or 20 and the processor can 
process five data at once, lines 1-5 are repeated twice or four times to produce an 
entire column of the interleaver matrix. Similarly, in Table 2 the line 2 corresponds to 
SUBGE, the line 4 to STOLT, and the line 5 to LOOP. 

20 Fig. 8 illustrates a block diagram of decoding system which can support turbo 

decoding and Viterbi decoding. Fig. 8 schematically shows data flow and address flow 
in turbo decoding mode. In Fig. 8, solid line indicates data flows and dotted line 
indicates address flow. Fig. 9 schematically shows data flow and address flow in 
Viterbi decoding mode. 

25 Referring to Fig. 8, turbo decoding system of the present invention comprises 

SISO decoder 810, processor block 830, buffer memory block 850 and address queue 
870. The processor block 830 includes SIMD (single-instruction and multiple-data) 
processor 832, an instruction memory 834, and a data memory 836. The SISO decoder 
810 implements a typical sliding-window decoder. It includes ACS A (add-compare- 

30 selector-add) network 812, and plurality of memory blocks, Ti memory 814, T 2 

memory 816, T3 memory 818, A memory 820 and hard decision memory 822. The 
buffer memory block 850 includes a Ae memory 852 for storing an extrinsic log- 
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likelihood ratio (LLR), plurality of memories for storing soft input data, y 3 memory 
854 for storing a nosy systematic bits multiplied by the channel reliability 862, y* 51 ~ 
y 153 memories 856, 858, 860 for storing a nosy parity bits multiplied by the channel 
reliability 862. 

5 The ACSA network 812 calculates forward metrics A*, backward metrics B*, 

and extrinsic log-likelihood ratio (LLR). Memory blocks, T\ memory 814, T 2 memory 
816, T 3 memory 818 stores input data and memory block A 820 temporarily stores 
the calculated forward metrics. Hard decision output of the ACSA network 812 is 
stored in the hard decision memory 822. SIMD processor 832 also calculates stopping 

10 criterion during SISO decoding from information stored in the hard decision memory 
822. Input data are read into one of the memory blocks, T\ memory 814, T 2 memory 
816, n memory 818 and used three times for calculating forward metric A*, backward 
metric B*, and LLR A(uk). 

Software interleaver is run on the SIMD processor 832. As described earlier, 

15 The SIMD processor 832 generates interleaved addresses column by column and the 
address queue 870 saves the interleaved addresses. When the SIMD processor 832 
calculates interleaved read addresses, the address queue 870 whose length is the SISO 
latency saves the interleaved addresses in order to use them again as the write 
addresses into the A e memory 852. Namely, when the ACSA network 812 produces 

20 results using the data read from the interleaved addresses, the results are stored into 
the corresponding place of the A e memory 852 with the write address stored in the 
address queue 870. In addition to the interleaving, the SIMD processor 832 controls 
the hardware blocks, interfaces with an external host, and processes the trellis 
termination and a stopping criterion during SISO decoding that does not need an 

25 interleaver. 

Since Viterbi algorithm does not calculate backward metrics in Viterbi 
decoding mode, some components of Fig. 8 are unused as shown in Fig. 9. In Fig. 9, 
components illustrated by dotted line such as address queue 970, channel reliability 
multiplication 962, and three T memories blocks 914, 916, 918 are not used in 
30 Viterbi decoding mode. The A e memory 852 of Fig. 8 serves as traceback memory 952 
and the A memory 820 of Fig. 8 serves as path metric memory. The SIMD processor 
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932 processes the traceback from the traceback information read from traceback 
memory 952. SISO network 912 is used for the Viterbi forward trellis recursion. 

Fig. 10 schematically shows detailed ACS A network and related memory 
blocks of Fig. 8. The SISO decoder 1000 in Fig. 10 implements sliding window turbo 
5 decoding technique in hardware. The ACS A network 1010 includes multiplexers 1012, 
1014, 1016, 1018, 1020 andACSAA 1022, two ACSA B 1024, 1026, and ACSA A 
1028. ACSA A 1022 and ACSA B 1024 comprise eight ACSA units and ACSA A 
1028 comprises fourteen ACSA units. An examplary ACSA unit of ACSA A 1002 is 
illustrated in Fig. 11. 

10 After input data sequences are stored in the Tj memory 1050 and T 2 memory 

1070, SISO decoding starts while new interleaved data sequences are stored in the T 3 
memory 1090. 

To implement the sliding window, first a window size of L input data are 
stored in Y memories 1050, 1070, 1090. According to the MAP algorithm, this block 
15 of written values is read three times. These three operations are performed in parallel 
in each ACSA sections in Fig. 10. The ACSA B section 1026 implements sliding 
window algorithm, which require a dummy backward recursion of depth L. In order to 
avoid the use of a multiple-port memory, three separated T memories of depth L are 
used so that the defined operations operate on them in a cyclic way. A memory 1030 
20 temporarily stores the calculated A *(s)'s. 

SISO outputs are obtained in the reversed order, but the correct order can be 
restored by properly changing the interleaving law of the decoder. 

To support multiple standards, configurable ACSA units can be employed in 
the SISO decoder in Fig. 10. The ACSA unit shown in Fig. 1 1 can be adapted to 
25 various RSC codes of different constraint length K , different coding rate of 1/2 to 1/5, 
and/or arbitrary generator polynomial. 

The Log-MAP algorithm outperforms the Max-Log-MAP algorithm if the 
channel noise is properly estimated. However, it is reported that the Max-Log-MAP is 
more tolerant to the channel estimation error than the Log-MAP algorithm. Thus 
30 present invention provides the ACSA unit that can be selected to use Log-MAP or 
Max-Log-MAP algorithm. 
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Fig. 1 1 shows an ACS A unit contained in the ACS A A section 1022 for 
calculating forward metric A k (s) of Fig. 10. 

Forward metric Ak(s) is calculated by equation (10) 



where A k (s ') is the forward metric of state s' at the previous time stamp k -1 
and r k (s\s) is the logarithmic branch probability tat the trellis state changes from s' 
to s at time stamp k. The ACS A unit 1 100 includes input multiplexers 1101, 1 103, 
1 105, 1 107, 1 109, 1 1 1 1, two adder blocks (CSA) 1113,1 115, two input adders 1 1 17, 
1 1 19, comparator 112, lookup table 1 123, output multiplexer 1 127, and output adder 



Two adder blocks (CSA) 1113, 1115 calculate branch probabilities T k (s\ s) 
given by equation (11) 



where A e (u k ) is the extrinsic LLR information from the previous SISO 
decoding and L c is channel reliability. Input data A (uk)+L c yk, L c j>/°, and h c y k pI are 
selected by multiplexers 1 101, 1 103, 1 105, 1 107, 1 109, 1 1 1 1 which can change the 
coding rate and the transfer function. The input adder 1117 adds output branch 
probabilities F *(s 0 , s) of the adder block 1113 and incoming data A *-i(so ). The 
input adder 1119 adds output branch probabilities V k (s } \ s) of the adder block 1115 
and incoming data A *-i(si ). The comparator 1 121 receives two input from the input 
adders 1117, 1119 and outputs max value of the two inputs and differential value 
between two inputs. Differential value is used to look up the table 1 123 that stores 
approximation offsets of (10), and max value is transferred in the output adder 1 127. 
The output multiplexer 1 125 selects decoding algorithm, Log-MAP or Max-Log MAP. 
If 0 is selected at the multiplexer 1 125, output A*(s) of the adder 1 127 is given as 
A*(s) = max(A k -i(s') + T *-/(s', s) , which corresponds to Max-Log MAP algorithm. 
Otherwise if the lookup table value is selected, output A*(s) of the adder 1 127 is given 




— (10) 



«max(A,. 1 (5')-f r k (s\s)) 
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as A*(s) = ln(exp [A k _i(s') + T *./(s , s)]), which is compensated by the offset read 
from the lookup table and corresponds to Log-MAP algorithm. 

In conventional hardware SISO decoders, the calculation of T *(s', s) is fixed, 
as all the x k { s are fixed for the target turbo code. However, the present invention can 
5 change the x k values in equation by configuring the input multiplexers in Fig. 11. This 
allows the change of coding rate and generator polynomials of the RSC encoder. The 
ACS A unit in the figure can support the rate of 1/2 to 1/5 turbo codes with arbitrary 
generator polynomials. To support less coding rate, the input to T *(s\ s) calculation 
logic should be increased. To support multiple constraint length k, the number of 

10 ACS A units and interconnection between the units can be changed. As mentioned 

above, the multiplexer 1 125 in the right determines the decoding algorithm: Log-MAP 
or Max-Log-MAP. If the reliable channel estimation from external host which 
calculates the channel estimation by using is obtained, for example, power control bit 
of the 3G communication systems, we can obtain better performance with Log-MAP 

15 algorithm by setting the multiplexer passes the look-up table value. On the other hand, 
if nothing is known about the channel, the Max-Log-MAP is used to avoid error due 
to channel misestimation error by passing 0 to the final adder of the ACSA unit. 

To keep pace with the hardware SISO described with reference to Figs. 10 and 
1 1, it is preferable that interleaved address generation be parallel processing. The 

20 present invention employs SIMD architecture because it is suitable for the simple and 
repetitive address generation and has simpler control and lower power consumption 
than superscalar or very long instruction word (VLIW) processor. Fig. 12 illustrates 
detailed SIMD processor of Fig. 8, In Fig. 12, dotted line indicates control flow and 
solid line indicates data flow. Considering the number of rows of W-CDMA block 

25 interleaver is multiple of five, the SIMD processor 1200 includes five processing 
elements PE0 ~ PE4. The bit widths of instructions and data are 16. The first 
processing element PE0 (1201) controls the other four processing elements PE1 ~ PE4 
and in addition it processes scalar operation. The first processing element PE0 1201 
fetches, decodes, and executes instructions including control and multi-cycle scalar 

30 instructions while the other four processing elements PE1 ~ PE4 only execute SIMD 
instructions. Instruction corresponding to program counter (PC) 1210 is fetched from 
instruction memory 1227 and temporarily stored into the instruction register (IR) 1213. 
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The fetched instruction is decoded by the decoder 1214 and then decoded instruction 
is executed accordingly in each the processing elements PEO - PE4, for example, each 
ALU 1221 executes add operations. After execution of instruction is completed, 
program counter (PC) 1210 is incremented by PC controller 1211 and new instruction 
5 is fetched from the instruction memory 1227. The other four processing elements PE1 
(1203), PE2 (1205), PE3 (1207), PE4 (1209) executes SIMD instructions. All the 
processing elements PEO ~ PE4 includes resister block 1215 for storing data for 
parallel operations. The register block 1215 includes vector resisters VR0 ~ VR15 
(1217). The register block 1215 of the first processing element PEO also includes 

10 additional scalar resisters 1219 to store scalar and control data. The second and fifth 
processing elements PE1 ~ PE4 includes register 1225 for temporarily storing 
encoded instruction. SIMD instruction is not executed in all processing elements at 
the same time, but executed in one processing element after another so that a data 
memory port and I/O port can be shared in a time-multiplexed fashion, which saves 

15 memory access power and provides a simple I/O interface. 

As mentioned before, specialized SIMD processor instructions, STOLT, 
SUBGE, and LOOP, are employed to replace common instruction sequences of three 
typical RISC instructions appearing in turbo interleaver programs. STOLT and 
SUBGE are SIMD instructions, whereas LOOP is a scalar control instruction, which 

20 is executed only in PEO. 

According to at least one embodiment of the present invention, the two stages 
of the block turbo interleaver performs preprocessing and on-the-fly generation in the 
SIMD processor. The speed of on-the-fly generation dramatically improves using the 
SIMD processor because of its SIMD parallel processing capability and its support of 

25 three special instructions. 

It should be noted that many variations and modifications may be made to the 
embodiments described above without substantially departing from the principles of 
the present invention. All such variations and modifications are intended to be 
included herein within the scope of the present invention, as set forth in the following 

30 claims. 
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